38 research outputs found
Parallel Algorithms for Hierarchical Nucleus Decomposition
Nucleus decompositions have been shown to be a useful tool for finding dense
subgraphs. The coreness value of a clique represents its density based on the
number of other cliques it is adjacent to. One useful output of nucleus
decomposition is to generate a hierarchy among dense subgraphs at different
resolutions. However, existing parallel algorithms for nucleus decomposition do
not generate this hierarchy, and only compute the coreness values. This paper
presents a scalable parallel algorithm for hierarchy construction, with
practical optimizations, such as interleaving the coreness computation with
hierarchy construction and using a concurrent union-find data structure in an
innovative way to generate the hierarchy. We also introduce a parallel
approximation algorithm for nucleus decomposition, which achieves much lower
span in theory and better performance in practice. We prove strong theoretical
bounds on the work and span (parallel time) of our algorithms.
On a 30-core machine with two-way hyper-threading on real-world graphs, our
parallel hierarchy construction algorithm achieves up to a 58.84x speedup over
the state-of-the-art sequential hierarchy construction algorithm by Sariyuce et
al. and up to a 30.96x self-relative parallel speedup. On the same machine, our
approximation algorithm achieves a 3.3x speedup over our exact algorithm, while
generating coreness estimates with a multiplicative error of 1.33x on average
ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms
Connected components is a fundamental kernel in graph applications due to its
usefulness in measuring how well-connected a graph is, as well as its use as
subroutines in many other graph algorithms. The fastest existing parallel
multicore algorithms for connectivity are based on some form of edge sampling
and/or linking and compressing trees. However, many combinations of these
design choices have been left unexplored. In this paper, we design the
ConnectIt framework, which provides different sampling strategies as well as
various tree linking and compression schemes. ConnectIt enables us to obtain
several hundred new variants of connectivity algorithms, most of which extend
to computing spanning forest. In addition to static graphs, we also extend
ConnectIt to support mixes of insertions and connectivity queries in the
concurrent setting.
We present an experimental evaluation of ConnectIt on a 72-core machine,
which we believe is the most comprehensive evaluation of parallel connectivity
algorithms to date. Compared to a collection of state-of-the-art static
multicore algorithms, we obtain an average speedup of 37.4x (2.36x average
speedup over the fastest existing implementation for each graph). Using
ConnectIt, we are able to compute connectivity on the largest
publicly-available graph (with over 3.5 billion vertices and 128 billion edges)
in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the
fastest existing connectivity result for this graph, in any computational
setting. For our incremental algorithms, we show that our algorithms can ingest
graph updates at up to several billion edges per second. Finally, to guide the
user in selecting the best variants in ConnectIt for different situations, we
provide a detailed analysis of the different strategies in terms of their work
and locality
Parallel Index-Based Structural Graph Clustering and Its Approximation
SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely
used graph clustering algorithm. For large graphs, however, sequential SCAN
variants are prohibitively slow, and parallel SCAN variants do not effectively
share work among queries with different SCAN parameter settings. Since users of
SCAN often explore many parameter settings to find good clusterings, it is
worthwhile to precompute an index that speeds up queries.
This paper presents a practical and provably efficient parallel index-based
SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel
algorithm improves upon the asymptotic work of the sequential algorithm by
using integer sorting. It is also highly parallel, achieving logarithmic span
(parallel time) for both index construction and clustering queries.
Furthermore, we apply locality-sensitive hashing (LSH) to design a novel
approximate SCAN algorithm and prove guarantees for its clustering behavior.
We present an experimental evaluation of our algorithms on large real-world
graphs. On a 48-core machine with two-way hyper-threading, our parallel index
construction achieves 50--151 speedup over the construction of
GS*-Index. In fact, even on a single thread, our index construction algorithm
is faster than GS*-Index. Our parallel index query implementation achieves
5--32 speedup over GS*-Index queries across a range of SCAN parameter
values, and our implementation is always faster than ppSCAN, a state-of-the-art
parallel SCAN algorithm. Moreover, our experiments show that applying LSH
results in faster index construction while maintaining good clustering quality
Parallel Integer Sort: Theory and Practice
Integer sorting is a fundamental problem in computer science. This paper
studies parallel integer sort both in theory and in practice. In theory, we
show tighter bounds for a class of existing practical integer sort algorithms,
which provides a solid theoretical foundation for their widespread usage in
practice and strong performance. In practice, we design a new integer sorting
algorithm, \textsf{DovetailSort}, that is theoretically-efficient and has good
practical performance.
In particular, \textsf{DovetailSort} overcomes a common challenge in existing
parallel integer sorting algorithms, which is the difficulty of detecting and
taking advantage of duplicate keys. The key insight in \textsf{DovetailSort} is
to combine algorithmic ideas from both integer- and comparison-sorting
algorithms. In our experiments, \textsf{DovetailSort} achieves competitive or
better performance than existing state-of-the-art parallel integer and
comparison sorting algorithms on various synthetic and real-world datasets
Fast, Parallel, and Cache-Friendly Suffix Array Construction
String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present CaPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CaPS-SA can easily be extended to exploit this structure to obtain further speedups
Parallel Graph Algorithms in Constant Adaptive Rounds: Theory meets Practice
We study fundamental graph problems such as graph connectivity, minimum
spanning forest (MSF), and approximate maximum (weight) matching in a
distributed setting. In particular, we focus on the Adaptive Massively Parallel
Computation (AMPC) model, which is a theoretical model that captures
MapReduce-like computation augmented with a distributed hash table.
We show the first AMPC algorithms for all of the studied problems that run in
a constant number of rounds and use only space per machine,
where . Our results improve both upon the previous results in
the AMPC model, as well as the best-known results in the MPC model, which is
the theoretical model underpinning many popular distributed computation
frameworks, such as MapReduce, Hadoop, Beam, Pregel and Giraph.
Finally, we provide an empirical comparison of the algorithms in the MPC and
AMPC models in a fault-tolerant distriubted computation environment. We
empirically evaluate our algorithms on a set of large real-world graphs and
show that our AMPC algorithms can achieve improvements in both running time and
round-complexity over optimized MPC baselines